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ABSTRACT 

The finished human genome-assemblies comprise 
several hundred un-sequenced euchromatic gaps, 
which may be rich in long polypurine/polypyrimidine 
stretches. Human chromosome 20 (chr 20) currently 
has three unfinished gaps remaining on its q-arm. 
All three gaps are within gene-dense regions and/or 
overlap disease-associated loci, including the 
DLGAP4 locus. In this study, we sequenced ~99% 
of all three unfinished gaps on human chr 20, 
determined their complete genomic sizes and 
assessed epigenetic profiles using a combination 
of Sanger sequencing, mate pair paired-end 
high-throughput sequencing and chromatin, methy- 
lation and expression analyses. We found histone 3 
trimethylated at Lysine 27 to be distributed across 
all three gaps in immortalized B-lymphocytes. In one 
gap, five novel CpG islands were predominantly 
hypermethylated in genomic DNA from peripheral 
blood lymphocytes and human cerebellum. One of 
these CpG islands was differentially methylated and 
paternally hypermethylated. We found all chr 20 
gaps to comprise structured non-coding RNAs 
(ncRNAs) and to be conserved in primates. We 
verified expression for 13 candidate ncRNAs, some 
of which showed tissue specificity. Four ncRNAs ex- 
pressed within the gap at DLGAP4 show elevated 
expression in the human brain. Our data suggest 
that unfinished human genome gaps are likely to 
comprise numerous functional elements. 



INTRODUCTION 

Completion of the sequence analysis of the human genome 
provided in-depth characterization of the physical 
sequence of genomic DNA (gDNA), albeit with 
numerous unfinished gap regions (1,2). After almost 11 
years of post-analysis, these gap regions remain present 
and uncharacterized in both euchromatic and heterochro- 
matic regions of the human nuclear genome (3). Some of 
the euchromatic gaps are flanked by segmental duplica- 
tions and remain technically challenging to sequence. 
Alternative approaches previously used for closing some 
of these gaps elsewhere in the genome, include fosmid re- 
sources (4), transcript-based approaches between paired 
expressed sequence tags (5), chromosome walking and 
shotgun sequencing (6) and more recently sequencing by 
synthesis using next-generation high-throughput 454 
sequencing technology (7). None of these single 
approaches has been successful in closing all the remaining 
gaps in the human genome, possibly due to their unique 
structural or sequence identity. 

The sequence assembly of human chromosome 20 (chr 
20) currently has three remaining unfinished gaps, all 
located on the long arm (q-arm) of the chromosome (8). 
All three gaps are within gene-dense euchromatic areas. 
There are no segmental duplications up to a distance of 
> 100 kb on either side of these gaps. Since all three human 
chr 20 gaps lay in gene-dense areas or overlapped loci 
associated with human disorders (9,10), we hypothesized 
that their molecular and transcriptional characterizations 
will provide valuable insight into the regulation of sur- 
rounding regions. In this study, we have successfully 
sequenced ~99% of the unfinished sequences and 
determined the correct sizes of the three remaining gaps 
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on human chr 20. Furthermore, we characterized their 
chromatin and methylation compositions, assessed tran- 
scription (predicted and actual transcripts) and measured 
sequence and RNA structure conservation (Figure 1A). 

MATERIALS AND METHODS 

Cell culture 

Epstein-Barr virus (EBV) transformation of peripheral 
blood lymphocytes (PBLs) was performed using estab- 
lished protocols (11). Mouse-human cell hybrid cell lines 
containing only one single paternally or maternally 
derived human chr 20 were generated using mouse E2 
cells as published elsewhere (12). The maternally inherited 
chr 20 homolog was disrupted by a chromosomal 



translocation t(8:20)(pl2;ql 1.23) that we previously 
reported (13). 

Sanger sequencing 

Overlapping polymerase chain reaction (PCR) primers 
(Supplementary Table SI) were designed using Primer 3 
software (14). Touchdown PCR followed by Sanger 
sequencing (15), was adopted to sequence and close chr 
20 gaps using gDNA from human PBLs of multiple 
controls. 

Mate pair paired-end Illumina sequencing of gDNA 

Mate pair paired-end sequencing of multiple gDNA 
libraries was performed as previously published (16), 
using an Illumina-recommended protocol and sequenced 
on a Genome Analyzer II (Illumina). Genomic DNA was 
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Figure 1. Strategy followed to sequence and characterize previously unfinished human chr 20 gaps. (A) Schematic representation of experimental 
workflow and study design. (B) BLASTN analysis of three remaining human chr 20 gaps (1,2 and 3) in the HuRef, Celera and GRC37 (hgl9) 
assemblies. Stripped rectangles denote gap regions sequenced here. 



6662 Nucleic Acids Research, 2012, Vol. 40, No. 14 



obtained from PBLs of three balanced chromosomal 
translocation carriers (not used for PCR and Sanger 
sequencing) with intact chr 20 homologs. Fragmented 
gDNA was sonicated to sizes of 3-kb fragments. 

Immunoprecipitation of hypermethylated gDNA using 
his-MBD2b (MBD-Seq) 

Approximately 2-4 ug of gDNA was obtained from PBLs 
of either a translocation carrier (13), phenotypically 
normal controls or purchased normal human cerebellum 
(41 -year-old male donor, Cat # TCB-4098, Capital 
Biosciences) and sonicated/nebulized, followed by purifi- 
cation using QIAquick columns and purification kits 
(Qiagen). Immunoprecipitation (IP) was performed using 
a his-tagged MBD2b (termed MBD) antibody (0.1 ug/ul) 
and the MethylCollector kit (Cat # 5502, Active Motif) 
according to manufacturer's instructions followed by 
high-throughput sequencing. 

Methylation-sensitive restriction enzyme digestion 
sequencing of gDNA 

Restriction enzyme digests were performed using the 
methylation-sensitive 7/paII enzyme according to the 
manufacturer's instructions (New England Biolabs). 
Digested gDNA was size selected using 2% agarose gel 
electrophoresis. A region corresponding to 150-300 bp 
was excised and purified using a gel purification kit 
(Qiagen) followed by high-throughput sequencing. 

Chromatin IP sequencing 

Chromatin IP (ChIP) was performed using sonicated 
chromatin obtained from EBV-transformed PBLs of a 
translocation carrier (13) and antibodies for histone 3 
trimethylated at Lysine 27 (H3K27me3) (Active Motif, 
Cat. #39155), H3+ (Abeam, Cat #abl791) or IgG (Cell 
Signaling Technology Cat #2729S) as previously pub- 
lished (17) or using the ChlP-IT express kit (Cat # 
53008, Active Motif) followed by high-throughput 
sequencing. 

High-throughput Illumina sequencing of RNA sequencing 

High-throughput RNA sequencing (RNA-Seq) was per- 
formed using poly-A RNA from human hippocampus 
and temporal lobe (Ambion) tissues together with 
paired-end sequencing adapters, an Illumina-recom- 
mended sequencing protocol and run on a Genome 
Analyzer II (Illumina). In addition, the Body Map 2.0 
sequence data archived at the European Nucleotide 
Archive (ENA) under accession ERP000546 was used 
for RNA-Seq analysis of other human tissues as described 
elsewhere (18). 

High-throughput Illumina sequencing 

Library preparation of gDNA obtained from MBD-Seq, 
methylation-sensitive restriction enzyme sequencing 
(MRE-Seq) and ChlP-Seq experiments was performed 
using single read or paired-end adapters and sequenced 
on a Genome Analyzer II according to manufacturer's 
instructions (Illumina). ENCODE raw sequenced reads 



for H3K27me3 ChIP on GM 12878, EBV-PBLs were 
obtained from UCSC genome browser (genome.ucsc.edu) 
(19). Unique sequenced reads from all high-throughput 
experiments including ENCODE H3K27me3 ChlP-Seq 
experiments were re-aligned to an altered hgl8 assembly 
that included sequence newly identified in this study for all 
three chr 20 gaps (termed hgl8 + chr 20 gaps genome 
assembly) using Bowtie (20). The 709 bp starting from 
the 5'-end of gap 2 was not included in the re-alignment 
assembly (hgl8 + chr 20 gaps) file so as to not change the 
genome base positions (chromosome offsets) on either side 
of the gap 2 during re-alignment. Importantly, no signifi- 
cant reads were shown to align to this region in any 
ChlP-Seq, MBD-Seq or MRE-Seq experiment. The 
required numbers of unsequenced nucleotides (NNNs) 
were maintained in gaps 1 and 3 as they were estimated 
to be 20 and 110 kb, respectively, in the hgl8 human- 
genome assembly. MBD-Seq, MRE-Seq and ChlP-Seq 
reads were analyzed using Cisgenome software as 
published elsewhere using a false discovery rate or 
<0.01 (21). ChlP-Seq data were analyzed using H3+ or 
IgG as negative controls. Paired-end RNA-Seq reads had 
a maximum distance of ~5 kb between each pair. 

In situ hybridizations 

In situ hybridizations were performed using purchased 
control human cerebellum brain sections (70-year-old 
female donor, Cat # TCB-4089 Capital Biosciences) as 
previously published (22) using Fluorescein-labeled oligos 
(Supplementary Table S7). Signals were analyzed using an 
MVX10 microscope, F-View black/white and CelLP 
software (Olympus). 

Bisulfite allelic sequencing 

Bisulfite conversion of gDNA was performed using the 
Methylation Gold kit (Zymo Research) according to 
manufacturer's instructions. Bisulfite allelic sequencing 
was performed using primers listed in Supplementary 
Table SI, Methyl-specific PCR, touchdown PCR condi- 
tions (15,23) and Sanger sequenced on an ABI 3730 x 1 
DNA analyzer (Applied Biosystems). Sequence output 
was analyzed using Chromas software (Technelysium) 
and BLASTN (24). 

Expression analysis 

Complementary DNA (cDNA) was prepared from total 
RNA obtained from a human multi-tissue panel 
(NSGene) and human brain panel (BioCat), using 
Superscript II Reverse Transcriptase (Invitrogen). Real 
time PCR (RT-PCR) was performed with multiple 
batches of cDNA samples, primers listed in 
Supplementary Table SI and run on the Opticon 2 
thermocycler (Bio-Rad) using FastStart DNA 
Masterplus SYBR green I (Roche). RT-PCR using 
cDNA prepared from RNA of a human multi-region 
brain panel (BioCat) was performed in duplicates or trip- 
licates. Data were quantified and normalized using the 
geometric mean normalization factor as published else- 
where (25) and housekeeping genes: GAPDH, ATP5A1, 
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C0X4A. ATP6A, ALAS1, B2M, ATP8A, PBGD, G6PD 
and HPRT (Supplementary Table SI). 

Northern blot analysis 

Northern blot analysis was performed using purchased 
total human cerebellum RNA of a 26-year-old normal 
human male (Cat # 4099, Capital Biosciences). 
Amplified PCR products were purified using the 
QIAquick purification kit and columns (Qiagen). 
Northern blotting was performed using methods pub- 
lished previously (26,27) using purified PCR products 
end labeled with T4 PNK and [y- 32 P] ATP or body 
labeled by the random primer method and [a- 32 P] ATP 
(specific activity >3000 Ci/mmol, PerkinElmer). 

CpG island prediction 

CpG islands (Supplementary Table S3) were predicted 
using previously established criteria (28,29) and the CpG 
island Searcher tool (30). 

Visualization of aligned high-throughput sequenced 
reads in chr 20 gaps 

The graphical user interface Map View (31), Cisgenome 
(21) and/or the genome browser at UCSC (19) were 
used to visualize short-read data for each of the gap 
regions at nucleotide level, using high-throughput 
MRE-Seq MBD-Seq, ChlP-Seq, RNA-Seq and mate 
pair paired-end sequenced data sets. 

Conservation analysis of human chr 20 gaps 

The gap regions and 23 other genome assemblies were 
aligned using the lastz alignment (version 1.02, gap open 
penalty of O = 400 and gap extension penalty of E = 3) 
(32). Lastz output was chained by the axtChain program 
(minimum chain score of 3000 and 'medium' linear gap 
matrix) and chained alignments were processed into nets 
by chainNet and netSyntenic, which are all available from 
Webb Miller's lab at Penn State University (http://www 
.bx.psu.edu/miller_lab/). Multiple alignments were created 
with roast from the tba/multiz alignment program (33) 
version 012909 also available from Webb Miller's lab. A 
dynamic programming range of 30 and a minimum block 
size of size 1 were employed. Phylogenetic trees were first 
estimated by the neighbor joining algorithm under the 
K80 distance model. The branch lengths were 
maximum-likelihood estimated using the HKY85 substi- 
tution model. The evolutionarily conserved elements in 
the multiple alignments of the three gaps are measured 
by phastCons scores (34). We let phastCons (35) itself 
estimate the phylogenetic models (conserved and 
non-conserved) directly from the 24-way multiple align- 
ments and the previously calculated neighbor joining 
trees (phastCons' option 1). 

Prediction of structured RNAs 

We searched the gap regions that were not repeat masked 
for known non-coding RNAs (Rfam) (36) by using 
INFERNAL (£<1E _S ) (37) and BLAST CE^IE" 10 ). 
We applied CMfinder (38) to simultaneously predict 



alignment and consensus secondary structures of 
syntenic sequences. The resulting RNA structural align- 
ments are described by covariance models (CMs) (39) 
that can be searched by INFERNAL. First, we screened 
the gap regions for structured RNA motifs previously pre- 
dicted by CMfinder (38) using INFERNAL (£<1E -6 ). 
These motifs were found by a human genome-wide 
screen similar to the screen of the ENCODE regions 
(40) and ran conserved in a pool of 17 vertebrates. 
Second, we run CMfinder on the 24-way multiple align- 
ments of the gap regions (P > 60). 

RESULTS 

Location and Sanger sequencing of previously unfinished 
gaps on the long arm of human chr 20 

Starting from the centromere on the long arm of human chr 
20, the first unfinished gap is located within the DLGAP4/ 
SAPAP4 [discs, large {Drosophila) homolog-associated 
protein 4 or PSD-95/SAP90-associated protein 4] at 
position chr 20:34360500-34380499 (NCBI build 36: 
hgl8) (Supplementary Figure SI A). The second gap (chr 
20:60524833-60551882, hgl8) is distal to the C20orf200\ 
NCRNA00335 (Chr20 Open Reading Frame (ORF) 200/ 
Non-coding RNA 335) (Supplementary Figure SIB). 
Whereas the third gap (chr 20:60623815-60733814, hgl8) 
is distal to the SLC04A1 (Solute Carrier Organic anion 
transporter family, member 4 Al) (Supplementary Figure 
SIC). From here on in this study each of these unfinished 
gaps on human chr 20 is annotated as gap 1, gap 2 and gap 
3, respectively. According to the hgl 8 genome assembly the 
three unfinished gaps are estimated to be 20 kb (gap 1), 
27 050 bp (gap 2) and 1 10 kb (gap 3), respectively. The 
recently released GRC37 (hgl9) genome assembly did not 
contain any new sequence representation for any of these 
three gaps on chr 20. The sizes of these chr 20 gaps accord- 
ing to the GRC37/hgl9 genome assembly were estimated to 
be 50 kb each and located at positions chr 
20:34897086-34947085 (gap 1), chr 20:61091438-61141437 
(gap 2), chr 20:61213370-61263369 (gap 3). Neither the 
hgl 8 nor the hgl 9 assemblies contain the complete 
gDNA sequence for any of the three remaining human 
chr 20 gaps. 

Here, we successfully obtained ~99% nucleotide 
sequence and identified the correct sizes for each of the 
three remaining gaps on human chr 20. We initially 
combined sequences available in the publicly released 
NCBI build 36/hgl8 and GRC37/hgl9 or the Celera and 
HuRef genome-assemblies using BLASTN (24) analysis of 
their flanking regions (Figure IB). Following this, we 
Sanger sequenced all three-gap regions on human chr 20 
in gDNA obtained from PBLs. We identified the complete 
and correct genomic sizes of all three gaps to be 9485 bp 
(gap 1), 27 759 bp (gap 2) and 16 383 bp (gap 3) 
(Supplementary Figure S1A-S1C), respectively. Starting 
from the known sequence outside each gap we designed 
overlapping PCR primers to amplify and sequence these 
gaps from the outside, going inwards. BLASTN analysis 
confirmed that most of the sequences obtained here, have 
matched with sequence already available in either the 
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Celera or HuRef assemblies or both, albeit for some dif- 
ferences in single nucleotides, possibly indicative of single 
nucleotide polymorphisms (Figure IB). None of the 
sequence we obtained is represented in neither hgl8 nor 
hgl9 genome assemblies. We also obtained novel sequence 
for some regions, previously not represented in either 
Celera or HuRef assemblies. Some regions that were 
indeed identified as unfinished in the Celera and HuRef 
assemblies were found to be artificial, suggesting that their 
flanking regions at either end were in fact joined. We 
found no evidence for segmental duplications on either 
side of these gaps up to a distance of <100kb. 

Mate pair paired-end Illumina sequencing of human 
chr 20 gaps 

Next, using paired-end sequencing of multiple mate pair 
human PBL gDNA libraries we confirmed sequence 
identity of all three human chr 20 gaps (Figure 2A-C). 
We used multiple gDNA samples, which were in 
addition to the ones used for Sanger sequencing. We 
modified the hgl8 genome assembly in a way that the 
gaps having the unsequenced nucleotides (NNN) on the 
long arm of chr 20 were replaced with sequence we 
obtained previously using Sanger sequencing, taking care 
not to change the genome base position (chromosome 
offsets) on either side of these gaps. We termed this 
altered human genome assembly: hgl8 + chr 20 gaps. We 
successfully obtained the complete sequence for gap 2 but 
were unable to generate sequence for ~348 bp in gap 1 and 
~413bp in gap 3 due to the highly repetitive nature of 
these regions as established using Sanger sequencing. 
Since we obtained an individual PCR product for these 
repetitive regions in gap 1 and gap 3, we were able to 
estimate their sizes but unable to generate complete nu- 
cleotide sequence (Supplementary Figure SI A and SIC). 
Using RepeatMasker (v3.3.0) (41) and the default human 
repeat database we found all the three gap regions on chr 
20 to have long interspersed nuclear elements, small 
interspersed nuclear elements, long transposon repeats 
and simple and low complexity repeats, yet at low copy 
numbers (Supplementary Table S2). Within gap 1 we 
obtained novel sequence for 1052 bp within a ~1400-bp 
region that was not present in any genome assembly. The 
sequence for the remaining ~348 bp contained within this 
region could not be generated due to a continuous TG 
repeat. 

H3K27me3 is distributed across all three human 
chr 20 gaps 

Following sequence analysis and identification of the 
actual sizes of the three chr 20 gaps we characterized 
their chromatin distribution. H3K27me3 is associated 
with repressed chromatin and is distributed on both the 
flanking sides of all three chr 20 gaps in most 
lineage-committed and cultured immortalized human 
somatic cells (42). Using ChlP-Seq we found H3K27me3 
to be distributed across all three-gap regions in EBV- 
PBLs (Figure 3A), suggesting them to be transcriptionally 
silenced. H3K27me3 distribution across all the three gaps 



was identified at the higher end of H3K27me3 distribution 
for the genome. 

Differentially methylated CpG island in human 
chr 20 gap 2 

Next, we assessed gDNA methylation (Ch 3 ~) at the 
carbon-5 residue of the cytosine base (termed 5mC) of 
CpG dinucleotides across all three human chr 20 gaps. 
Using MBD-Seq and MRE-Seq we found no widespread 
enrichment of gDNA hypermethylation at CpG dinucleo- 
tides in either gap 1 or gap 3 in PBLs or human cerebellum 
(Figure 3B). However, we found hypomethylation and 
hypermethylation peaks to be widespread across gap 2 
in gDNA from both PBLs and human cerebellum 
(Figure 3C). We identified five new CpG islands within 
gap 2, including one lying at its 3'-terminal end that 
extended outside the gap (Figure 3C). We also used a 
more stringent classification that predominantly identifies 
CpG islands at the 5'-end of genes (28) and identified three 
CpG islands. These overlapped with the CpG islands pre- 
dicted earlier in gap 2 using the less stringent criteria 
(Figure 3C and Supplementary Table S3). Using 
MBD-Seq we found all CpG islands in gap 2 to be pre- 
dominantly enriched for hypermethylated CpGs 
(Figure 3C). However, we found CpG island 4/ la within 
gap 2 to have both hypomethylated and hypermethylated 
peaks (Figure 3C), demonstrating it to be a differentially 
methylated region (43). In order to assess whether CpG 
island 4/ la was maternally or paternally methylated we 
took advantage of our previously established mouse- 
human E2 cell hybrid cell lines. Each cell hybrid contained 
only a single paternally or maternally derived human chr 
20. Bisulfite allelic sequencing of gDNA isolated from 
these mouse-human cell hybrids showed CpG island 
4/la to be paternally hypermethylated (Figure 3D). As a 
control we tested the methylation status of the paternally 
expressed MCTS1 (PSIMCT-1) pseudogene and NNAT 
(Neuronatin) gene located on the q-arm of human chr 20. 
We confirmed them to be paternally hypomethylated and 
maternally hypermethylated in these mouse-human E2 
cells (Supplementary Figure S2, data not shown) similar 
to data reported elsewhere (44,45). 

Conservation and expression analysis of human 
chr 20 gaps 

Since the analyzed regions are not annotated as gaps in 
other organisms including primates we performed pairwise 
alignments (lastz) (32) of the three gap regions including 
500 bp up and downstream with several other species. We 
found all three gaps on human chr 20 to be well covered 
and conserved in primates but not in more distant 
mammals (Figure 4A-F). Furthermore, we found evolu- 
tionary conserved elements to be present in each human 
chr 20-gap region (Supplementary Figure S3A-S3C). The 
multiple alignments of the gap regions are slightly more 
conserved than the 17-way UCSC multiple alignments 
that cover around 50% of the human (hgl8) genome. 
This is mostly due to the primate specificity of gap 
regions and presence of many regions of low complexity 
within these gaps. In order to assess transcriptional 
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Figure 2. Mate pair paired-end Illumina sequencing of human chr 20 gap regions. (A-C) Representative UCSC genome browser views of mate pair 
paired-end Illumina sequenced unique reads aligned to the three human chr 20 gap regions using the edited hgl8 + chr 20 gaps human genome 
assembly. Overlapping mate pair paired-end reads align across all three chr 20 gap. Regions are drawn to scale without disturbing chromosome 
offsets. 



activity within the chr 20 gaps, we initially screened for 
ORFs using the ORF finder at NCBI (46) and a minimum 
gDNA region of >300bp. We predicted 6 ORFs in gap 1; 
38 ORFs in gap 2 and 24 ORFs in gap 3 (Supplementary 
Table S4). Most ORFs in gap 2 overlapped with 
hypermethylation peaks identified (Figure 3C and 
Supplementary Tables S4). 

Using an in house non-coding RNA (ncRNA) annota- 
tion pipeline (47) we searched for known as well as de novo 
predicted RNA structures. Structural homology to known 
ncRNAs from Rfam vlO.O (48) was searched using tools in 
the INFERNAL package. Structural similarity to known 
ncRNAs from Rfam vlO.O was searched using tools in the 
INFERNAL package and BLAST. INFERNAL tools 
were also used to search for structural similarity to 
CMfinder-generated models of de novo ncRNA structure 
predictions elsewhere in the genome. These CMfmder 
models comprise structural features conserved in several 
vertebrate genomes. In addition, the generated 24-way 
multiple alignments of the gap regions were screened by 
CMfmder to find models of novel structured RNA motifs. 
Expression from these gaps was subsequently assessed 



using the multi-tissue RNA-Seq Body map 2.0 data. 
Though there are no matches to known ncRNAs in 
these gap regions, we predicted 16 structured RNA 
motifs in the multiple alignments of the gap regions 
(Supplementary Tables S5) and 19 additional structured 
RNAs in fourteen regions similar to RNA structures else- 
where in the genome (Supplementary Tables S6) with 
variable sizes of <250nt. The amount of RNA structures 
is slightly lower in the gap regions compared to the 
conserved RNA structures in the human genome (pre- 
dicted by CMfmder with pscore >60; data not shown). 
Sixteen of the structured RNAs are <50nt (12 from the 
first and 4 from the second set of predictions) and 4 others 
originate from the same genomic locus. We chose to 
validate expression for the remaining 11 candidates with 
similar conserved structure elsewhere in the human 
genome (Supplementary Tables S7 and S8). These 
ncRNAs are mostly primate specific in the gap regions 
and thus predicted with low scores from the 24-way 
multiple alignments (due to short evolutionary 
distances). Three of these ncRNA candidates (numbers 5, 
6 and 9) are present in gap 1 and a set of four ncRNA 
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Figure 3. Chromatin distribution and DNA methylation profile across human chr 20-gap regions. (A) H3K27me3 is distributed across all three 
human chr-20 gap regions in EBV-PBLs. (B) MRE-Seq (hypomethylated) and MBD-Seq (hypermethylated) aligned peak reads within gaps 1 and 3. 
Methylation distribution is similar in PBL and human cerebellum for both gaps 1 and 3. (C) Methylation profile of gap 2 shows it to be enriched for 
both hypomethylated and hypermethylated CpGs, respectively. CpG island 4/ la (red rectangle) is enriched for both MBD-Seq and MRE-Seq reads 
suggesting it to be a differentially methylated locus. CpG islands 2-6 were annotated using the classification: 500-bp region of genomic DNA with 
>50% CG content and an observed CpG to expected CpG ratio of 0.6. CpG islands la-3a were classified based on a CpG island being a 500-bp 
region of genomic DNA with >55% CG content and an observed CpG to expected CpG ratio of 0.65. Images are drawn to scale. (D) Differentially 
methylated and paternally hypermethylated CpGs island 4/la within gap 2 using bisulfite allelic sequencing and gDNA from mouse-human cell 
hybrid cell lines or diploid human cells. 
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Figure 4. Conservation analysis of human chr 20 gaps. The coverage versus pairwise sequence identity of 23 organisms are shown as scatterplots for 
gap 1 (A), gap 2 (B) and gap 3 (C). Primates are clearly separated from more distant organisms. In the phylogenetic trees, gap 1 (D), gap 2 (E) and 
gap 3 (F) are all seen to be clustered together with primate species. Phylogenetic trees (generated from the 24-way multiple alignments) show all three 
human gaps regions tightly clustered with primates. 



candidates each are in gap 2 (numbers 1, 4, 7 and 8) and 
gap 3 (numbers 2, 3, 10 and 11). 

Candidate ncRNA numbers 10 and 11 originated from 
the same genomic region in gap 3. Using BLASTN 
(E: <le" 15r ) we found no significant overlap in sequence 
between the ncRNA candidates and coding sequences 
anywhere else in the human genome (hgl8 genome 
assembly and data not shown). We initially validated ex- 
pression of all 1 1 CMfinder-predicted structured ncRN As 
using universal human reference total RNA. Using qPCR 
and gel electrophoresis we identified the sizes of all 
products amplified to be of the predicted size 
(Figure 5A). To assess tissue specificity of four randomly 
chosen candidate ncRNAs we selected two candidate 
ncRNAs each from gap 1 (numbers 5 and 9) and gap 2 
(numbers 7 and 8) and assessed their tissue-wide expres- 
sion using total RNA from a multi-tissue human panel. 
Both ncRNA candidate numbers 5 and 9 are expressed in 
a tissue-specific manner and predominantly in human 



brain regions (Figure 5B). Candidate numbers 7 and 8 
are similarly expressed in a tissue-specific manner in 
tissues including the colon and kidney but at low levels 
in the brain (Figure 5B). Since ncRNA candidate numbers 
5, 6 and 9 are expressed from the DLGAP4 locus known 
to be highly expressed in the rat brain (49), we quantified 
their expression in the human brain. We found all three 
candidate ncRNAs to be widely expressed in the human 
brain with elevated expression in the cerebellum, occipital 
and temporal lobes (Figure 5C). Their covariant CMfinder 
predicted ncRNA structures are conserved in the same 
organisms including macaque, mouse, rat, cow and dog 
(Figure 5D). The predicted structures and conservation of 
all putative ncRNAs are listed in Supplementary Tables 
S4 and S5 excluding candidate ncRNA number 8 as it had 
a weakly conserved predicted secondary structure despite 
of its statistical significance calculated by INFERNAL. 
We determined that the DLGAP4 mRNA isoform is 
expressed specifically in human brain samples 
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Figure 5. De novo prediction and experimental verification of structured ncRNAs within human chr 20 gaps. (A) Expression confirmation of 
ncRNAs predicted by CMfinder using total universal reference RNA. (B) Expression of candidate ncRNAs in a human multi-tissue RNA panel. 
(C) Elevated expression of candidate ncRNAs: 5, 6, 9 and 12 in a human multi-region brain panel. (D) RNA structure of ncRNA numbers: 5, 6, 9 
predicted using INFERNAL. The Vienna RNA conservation coloring scheme highlights the mutational pattern with respect to the structure and a 
circle around a variable base(s) marks consistent and compensatory mutations. The color coding is explained in a color scheme legend. There are six 
different canonical base pairs (G-C, C-G, A-U, U-A, G-U, U-G) described by the different colors (.\-axis), and if the base pair is not supported in a 
sequence of the alignment the color shading gets lighter (y-axis, incompatible pairs). (E) Candidate ncRNA number 5 is specifically expressed in the 
nucleus of Purkinje cells in the human cerebellum. (F) Localized view of the DLGAP4 promoter (chr 20:34,355,500-34,385,499, hgl8 + chr 20 gaps) 
showing candidate ncRNAs in gap 1. RNA-Seq tracks show forward and reverse aligned reads in blue and green, respectively. 



(Supplementary Figure S4A). Using qPCR we found no 
inclusion of sequences from gap 1 transcribed into the 
mRNA sequence of DLGAP4 isoform a (data not 
shown). Subsequently, we characterized the spatial and 



temporal expressions of ncRNA candidate number 5 in 
gap 1, expressed from within this brain-specific DLGAP4 
locus in the human brain. This candidate ncRNA number 
5 expressed from the sense strand was highly expressed in 
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Purkinje neurons in the cerebellum similar to the 
DLGAP4 mRNA isoform a (Figure 5E, data not 
shown). Importantly, we found candidate ncRNA 
number 5 to be expressed in a specifically localized 
manner in the nucleus of Purkinje neurons (Figure 5E) 
attributing it a possible regulatory role. We ruled out 
the possibility of these candidate ncRNAs as being add- 
itional non-coding exons of DLGAP4 mRNA isoform as 
no sequence from the gap was transcribed to mature 
DLGAP4 mRNA isoform a transcript (data not shown). 

We next asked if any other transcribed elements could 
be identified through an independent high-throughput 
RNA-Seq approach. We found some overlap of predicted 
structured ncRNAs within gap 1 in RNA samples from 
different regions of the human brain (Figure 5F, 
Supplementary Figure S4A and Supplementary Tables 
S5 and S6). However, most regions predicted using 
CMfinder as candidate ncRNAs were not identified in 
the RNA-Seq data sets. Furthermore, we found one add- 
itional transcript (ncRNA candidate number 12, 
Figure 5F and Supplementary Table S6) expressed from 
the sense strand in gap 1 and in different regions of the 
brain. We also found one additional transcript in gap 2 
specifically expressed in RNA from PBLs (termed candi- 
date ncRNA number 13, Supplementary Figure S4B and 
Supplementary Table S6) using the RNA-Seq BodyMap 
2.0 data set. Candidate ncRNA numbers 12 and 13 were 
identified using RNA-Seq and not by CMfinder/ 
INFERNAL predictions. Quantification of ncRNA 
number 12 in our multi-region human brain panel 
identified it to be elevated in expression in brain regions 
that was similar to the expression pattern of the other 
ncRNA candidate's numbers 5, 6 and 9 identified previ- 
ously (Figure 5C). RNA-Seq data also identified ncRNA 
candidate number 13 expressed in PBLs to have two po- 
tential exons with total size of ~680bp (Supplementary 
Figure S4C). None of the candidate ncRNAs identified 
here was present in any repetitive elements 
(Supplementary Table S2). 



DISCUSSION 

In this study, we have successfully sequenced three previ- 
ously unfinished gaps on the q-arm of human chr 20. With 
the addition of these 52 866 sequenced nucleotides, the full 
size and sequence of chr 20 is now complete with the ex- 
ception of short repeat stretches in gap 1 (~348bp) and 
gap 3 (~413bp). We determined the complete sizes of all 
three remaining gaps on human chr 20 as 9485 bp (gap 1), 
27 759 bp (gap 2) and 16 383 bp (gap 3) using gDNA from 
multiple normal individuals using Sanger sequencing and 
high-throughput mate pair sequencing. This confirms the 
robustness of the methodologies used here for sequencing 
of previously unfinished human genome gaps. Analysis at 
the outset in BLASTN of these human chr 20 gaps while 
including the data from the Celera and HuRef assemblies 
revealed that sequence present in one assembly (Celera or 
HuRef) was sometimes identified as a gap in the other 
assembly or vice versa. Most of the remaining unfinished 
regions in the Celera and HuRef assemblies were 



complemented each other and were similarly sized to 
what we obtained in this study but were not present at 
all in the hgl8 or hgl9 genome assemblies. Although the 
hgl9/GRC37 genome assembly was released to include 
further sequencing coverage of the human genome, no 
sequence representation for these chr 20 gaps was made 
(50). A similar approach may be applied to re-sequence 
and close other remaining unfinished gaps of the human 
genome. However, for this approach to be successful there 
should be no segmental duplications flanking the immedi- 
ate gap region on either side. In addition, some sequence 
should be represented in either the publicly released Celera 
or HuRef genome assemblies. Although sequencing of 
these gaps is not trivial, a systematic approach based on 
a combination of techniques used here and in other studies 
(4,7) applied together, may help in closing the remaining 
unfinished human genome euchromatin gaps. 

Evidence from previous studies has shown gap regions 
to be rich in purine and/or pyrimidine stretches that are 
known to form Z-DNA structures (51). However, we 
found long stretches of TG and CT repeats in two of 
the three remaining human chr 20-gap regions. The repeti- 
tive nature of these sequences render them problematic to 
cloning techniques including those used in the human 
genome project (1). Gap 1 is comprised of >500bp of 
GT and > 300 bp of CT continuous nucleotide repeats. 
Gap 2 contains long repeats, however, gap 3 contains a 
>400bp CCATC stretch. Some of these sequences are 
present in the publicly released Celera genome assembly, 
suggesting the shotgun approach may have assisted in 
their sequencing and assembly (52). The latter is supported 
by our ability to align human mate pair paired-end 
genomic reads from multiple and independent experiments 
to these gap sequences. We did try to sequence the remain- 
ing > 500 bp regions in gaps 1 and 3 using unmapped 
mates from our Illumina sequence assemblies and one 
other independent source (SRA accession ERX009608), 
however, we were unable to obtain enough coverage to 
close them. 

We characterized five novel CpG islands within gap 2 
in gDNA from PBLs and human cerebellum tissue, three 
of which are GC rich. We also identified CpG island 
4/ la to be differentially methylated and paternally 
hypermethylated. Human chr 20 has several maternally 
imprinted genes most of which are paternally expressed 
(44,45). Since we found CpG island 4/la in gap 2 to be 
paternally methylated, we speculate that expression at this 
locus may occur from its maternally inherited 
hypomethylated allele in a tissue-specific manner. 
Evidence from recent studies has shown tissue-specific ex- 
pression of certain loci in the nuclear genome may be 
regulated by cis/ trans acting ncRNAs. Emerging func- 
tional roles for ncRNAs in regulating the expression of 
loci highly expressed in the brain and involved in 
regulating pluripotency and neurogenesis is also 
becoming increasingly evident (53,54). Taking these obser- 
vations into consideration, we searched human chr 20 
gaps for ncRNAs using a combination of computational 
analyses and high-throughput RNA-Seq. We confirmed 
expression of a total of 13 (11 CMfinder/INFERNAL 
predicted structured ncRNAs and 2 identified using 
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RNA-Seq) candidate ncRNAs in these 3 human chr 20 
gaps. CMfinder and INFERNAL have been used to 
predict conserved RNA secondary motif structures else- 
where in the human genome (38,40). We found most of 
these predicted ncRNAs to have structural similarity to 
other RNA structures transcribed elsewhere in the 
human genome, most of which were also found in other 
vertebrates. 

Almost any genomic locus can theoretically fold into 
thermodynamically predicted RNA secondary structures, 
which is what RNAfold calculates. It has also been shown 
that secondary structure alone is statistically insignificant 
for the detection of putative ncRNAs (55). On the other 
hand, agreement in the literature exists for comparative 
sequence analysis and comparison to structural back- 
ground as a means to discover real and/or functional 
ncRNAs (38,56,57). Therefore, the applied prediction 
method using CMfinder accounts for thermodynamics 
and compensatory base pair changes (A-U in human 
and G-C in mouse) in structurally re-aligned sequences 
from 18 different vertebrate genomes and compares the 
signal to the structural background for finding statistically 
significant ncRNA structures. 

Quantification of ncRNA numbers 5, 6, 9 and 12 in the 
human brain panel identified all four to be elevated in the 
human cerebellum. Furthermore, we found ncRNA 
number 5 to be expressed in specific regions of Purkinje 
neuron cells in adult human cerebellum. These candidate 
ncRNAs are expressed at elevated levels in the brain and 
from within the DLGAP4 locus coding for the 
brain-specific DLGAP4 mRNA isoform a. Their elevated 
expression in the human brain and high clustering in 
certain regions of the Purkinje cells including the nucleus 
may in fact drive expression of DLGAP4 mRNAs or other 
mRNAs in the brain. We speculate that putative ncRNAs 
characterized here may have roles in the regulation of 
disease-linked loci at 20q (9,10). Since we found candidate 
ncRNA numbers 12 and 13 using high-throughput 
RNA-Seq and not by prediction-based methods, we 
suggest a combination of approaches should be used to 
identify new putative ncRNAs in other regions of the 
human genome as well. Northern blot analysis using 
total RNA from whole-human brain and human cerebel- 
lum tissues together with randomly or end-labeled PCR 
amplified probes corresponding to three putative 
structured ncRNA numbers 5, 6 and 9 failed to reveal 
their size (data not shown). This may be due to the 
small length of each probe (<250bp) used and the fact 
that this was based solely on the size of the predicted 
structural part. Furthermore, these ncRNAs are expressed 
at relatively low levels compared to mRNAs as evidenced 
by qPCR and may exist as multiple transcripts of varying 
length or spliced variants making them more difficult to 
detect by northern blot analysis. Difficulties associated 
with the confirmation of ncRNAs by northern blot 
analysis have previously been reported (58,59). 
Analyzing flanking regions on either side of these three 
human chr 20 gaps identified them as being conserved in 
primates and having evolutionary conserved elements 
within them. From an evolutionary perspective this area 
of chr 20 is dense in genes associated with development 



and its rather low conservation suggests functions unique 
to primates and especially Homo sapiens. Further work 
would be necessary to understand the specific nature and 
functional significance of these expressed transcripts, if 
any. In conclusion, our study shows that unfinished eu- 
chromatic gaps in the human genome may harbor specific 
epigenetic domains and contain conserved genomic 
elements with regulatory potential, including differentially 
methylated CpG-islands and other functional elements, as 
well as ncRNA genes, the function of which is now open 
for further study. 
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